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Abstract 

We discuss a technique that allows blind recovery of signals or blind identification of mixtures in instances where 
such recovery or identification were previously thought to be impossible: (i) closely located or highly correlated 
sources in antenna array processing, (ii) highly correlated spreading codes in CDMA radio communication, (iii) 
nearly dependent spectra in fluorescent spectroscopy. This has important implications — in the case of antenna array 
processing, it allows for joint localization and extraction of multiple sources from the measurement of a noisy mixture 
recorded on multiple sensors in an entirely deterministic manner. In the case of CDMA, it allows the possibility of 
having a number of users larger than the spreading gain. In the case of fluorescent spectroscopy, it allows for detection 
of nearly identical chemical constituents. The proposed technique involves the solution of a bounded coherence low- 
rank tensor approximation problem. We show that bounded coherence allows us to establish existence and uniqueness 
of the recovered solution. We will provide some statistical motivation for the approximation problem and discuss 
greedy approximation bounds. To provide the theoretical underpinnings for this technique, we develop a corresponding 
theory of sparse separable decompositions of functions, including notions of rank and Schatten norms that specialize 
to the usual one for matrices and operators but applies to also hypermatrices and tensors. 



I. Introduction 

There are two simple ideas for reducing the complexity or dimension of a problem that are widely applicable 
because of their simplicity and generality: 

• Sparsity: resolving a complicated entity, represented by a function /, into a sum of a small number of simple 
or elemental constituents: 

r 

v =\ 

• Separability: decoupling a complicated entity, represented by a function g, that depends on multiple factors 
into a product of simpler constituents, each depending only on one factor: 

d 

,g(xi, ...,x d ) = JJ <^fc(x fe ). 
fc=i 

The two ideas underlie some of the most useful techniques in engineering and science — Fourier, wavelets, and 
other orthogonal or sparse representations of signals and images, singular value and eigenvalue decompositions of 
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matrices, separation-of- variables, Fast Fourier Transform, mean field approximation, etc. This article examines the 
model that combines these two simple ideas: 

r d 

/(x!,...,x d ) = ^a p JJ^fcp(x fc ), (1) 

P =i k=i 

and we are primarily interested in its inverse problem, i.e. identification of the factors (pk p based on noisy 
measurements of /. We shall see that this is a surprisingly effective method for a wide range of identification 
problems. 

Here / is approximately encoded by r scalars, a. — (ai, . . . ,a r ) £ C r , and dr functions, ipk p , k = 1, ...,d; 
p = 1, . . . , r. Since d and r are both assumed to be small, we expect ([T} to be a very compact, possibly approximate, 
representation of /. We will assume that all these functions live in some Hilbert spaces and that tp kp are of unit 
norm (clearly possible since the norm of cff. p can be 'absorbed into' the coefficient a p in ([TJ). 

Let Hk — max p ^ g |((/?fcp, tp kq )\ and define the relative incoherence tuk = (1 — Mfe)/ fJ-k for k = 1, . . . ,d. Note that 
/i/c G [0, 1] and u>k € [0,oo]. We will show in this article that if d > 3, and 



Y," k > 2r-l, 



(2) 



fe=i 



then the decomposition in ([T]i is essentially unique and sparsest possible, i.e. r is minimal. Hence we may in 
principle identify tp kp based only on measurements of the mixture /. 

One of the keys in the identifiability requirement is that d > 3 or otherwise (when d — 1 or 2) the result would 
not hold. We will show that the condition d > 3 however leads to a difficulty (that does not happen when d = 1 or 
2). Since it is rarely, if not never, the case that one has the exact values of /, the decomposition ([1} is only useful 
in an idealized scenario. In reality, one has / = / + e, an estimate of / corrupted by noise e. Solving the inverse 
problem to ([T]) would require that we solve a best approximation problem. For example, with the appropriate noise 
models (see Section |Vj, the best approximation problem often takes the form 



argmm 



/ a v n 

p=l k = l 



(3) 



with || • || an L 2 -norm. Now the trouble is that when d > 3, this best approximation problem may not have a solution 



— because the infimum of the loss function is unattainable in general, as we will discuss in Section VIII-A In 
view of this, our next result is that when 

d 

JJ(l+o; fc )>r, (4) 



k=l 



the infimum in ([3]) is always attainable, thereby alleviating the aforementioned difficulty. A condition that meets 
both |2| and Q is easy to obtain because of the arithmetic-geometric mean inequality 



ri( i+ ^) 



k=l 



nVd d 

k=l 



Wfc. 
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II. Sparse separable decompositions 

The notion of sparsity dates back to harmonic analysis fl64l . ITD . Il50l and approximation theory l65l . and has 
received a lot of recent attention in compressive sensing ESI . ifTOl . Il28l . lfT31 . The notion of separability is also 
classical — the basis behind the separation-of- variables technique in partial differential equations [6] and special 
functions [53 1, fast Fourier transforms on arbitrary groups BTl . mean field approximations in statistical physics 
(44), and the naive Bayes model in machine learning \5\, B6l . We describe a simple model that incorporates the 
two notions. 

The function / : X — > C or K to be resolved into simpler entities will be referred to as our target function. 
We will treat the discrete (X is finite or countably infinite) and continuous (X is a continuum) cases on an equal 
footing. The discrete cases are when / is a vector (if X = [ni] = {1, . . . , ni}), a matrix (if X = [n\] x [712]), a 
hypermatrix (if X = [m] x [712] x ■ • • x [rid]), while the usual continuous cases are when / is a function on some 
domain X = fl C M. m or C m . In the discrete cases, the set of target functions under consideration are identified 
with C" 1 , C" lX ™ 2 , C™ixn 2 x - xn d respectively whereas in the continuous cases, we usually impose some additional 
regularity structures such integrability or differentability, so that the set of target functions under consideration are 
L 2 (f2) or C°°(f2) or H k (fl) = W k ' 2 (Q), etc. We will only assume that the space of target functions is a Hilbert 
space. Note that the requirement d > 3 implies that / is at least a 3-dimensional hypermatrix in discrete case or a 
function of at least three continuous variables, i.e. m > 3, in the continuous case. The identifiability does not work 
for (usual 2-dimensional) matrices or bivariate functions. With ([TJ in mind, we will call / a d-partite or multipartite 
function if we wish to partition its arguments into d blocks of variables. 

We will briefly examine the decompositions and approximations of our target function into a sum or integral of 
separable functions, adopting a tripartite notation for simplicity. There are three cases: 

• Continuous: 

/(x,y,z)= f 0(x,tMy,t)V(z,t)dz/(t). (5) 
Jt 

Here we assume that v is some given Borel measure and that T is compact. 
. Semidiscrete: 

r 

/(x, y, z) = ^2 ^p( x )^p(y)V-'p(z)- (6) 
This may be viewed as a discretization of the continuous case in the t variable, i.e. p (x) = £?(x, t p ), <^ P (y) = 

y(y,t p ), Vv( z ) = V'OMp)- 

. Discrete: 

r 

P =l 

This may be viewed as a further discretization of the semidiscrete case, i.e. a,^ = /(xj, yj, Zfe), Ui P = ^ p (xj), 
Vj p = <P P (yj), w kp = ij} p (z k ). 
It is clear that when k take finitely many values, the discrete decomposition |7| is always possible with a 
finite r since the space is of finite dimension. If i,j, k could take infinitely many values, then the finiteness of 
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r requires that equality be replaced by approximation to any arbitrary precision e > in some suitable norm. 
This follows from the following observation about the semidiscrete decomposition: The space of functions with a 
semidiscrete representation as in |6]), with r finite, is dense in C°(fl), the space of continuous functions. This is just 
a consequence of the Stone-Weierstrass theorem [21 1. Discussion of the most general case Q would require us to 
go into integral operators, which we will not do as in the present framework we are interested in applications that 
rely on the inverse problems corresponding to (|6) and |7|. Nonetheless |5]) is expected to be useful and we state it 
here for completeness. Henceforth, we will simply refer to |6| or |7]i as a multilinear decomposition, by which we 
mean a decomposition into a linear combination of separable functions. We note here that the finite-dimensional 



discrete version has been studied under several different names — see Section IX Our emphasis in this paper 
is the semidiscrete version |6]) that applies to multipartite functions on arbitrary domains and are not necessarily 
finite-dimensional. As such, we will frame almost all of our discussions in terms of the semidiscrete case, which 
of course also includes the discrete version |7} as a special case (when x, y, z take only finite discrete values). 

Example 1 (Mixture of Gaussians). Multilinear decompositions arise in many contexts. For example, in machine 
learning and nonparametric statistics, a fact of note is that Gaussians are separable 

exp(a; 2 + y 2 + z 2 ) — exp(a; 2 ) exp(y 2 ) exp(z 2 ). 

More generally for symmetric positive-definite A S R nx ™ with eigenvalues A = diag(Ai, . . . , A n ), 

n 

exp(x T Ax) = exp(z T Az) = J^J cxp(A.j;z 2 ), 

i=l 

under a linear change of coordinates z = Q T x where A = QAQ 1 . Hence, Gaussian mixture models of the form 

m 

/( x ) = a i exp t( x ~ /^) Ta j"( x _ 

where AiAj = AjAi for all i ^ j (and therefore Ax,...,A m have a common eigenbasis) may likewise be 
transformed with a suitable linear change of coordinates into a multilinear decomposition as in |6]). 

We will later see several more examples from signal processing, telecommunications, and spectroscopy. 
A. Modeling 

The multilinear decomposition — an additive decomposition into multiplicatively decomposable components — is 
extremely simple but models a wide range of phenomena in signal processing and spectroscopy. The main message 
of this article is that the corresponding inverse problem — recovering the factors 8 p ,ip p , ip p from noisy measurements 
of / — can be solved under mild assumptions and yields a class of techniques for a range of applications (cf. 



Section IX i that we shall collectively call multilinear identification. We wish to highlight in particular that multilinear 
identification gives a determistic approach for solving the problem of joint localization and estimation of radiating 
sources with short data lengths. Previous approaches based on cumulants ifTHIl require not only longer data lengths 
but also sources to be statistically independent. 
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The experienced reader would probably guess that such a powerful technique must be fraught with difficulties 
and he would be right. The inverse problem to |6]), like most other inverse problems, faces issues of existence, 
uniqueness, and computability. The approximation problem involved can be ill-posed in the worst possible way (cf. 



Section III I. Fortunately, prompted by study of the restricted isometry property in compressed sensing (interpreted 
in a broad sense, encompassing not only the ideas covered in iflOl . |9l , (15), GBI . 11261 . [32] but also in [7|, [8|, 
ll33l ). we will show here that mild assumptions on coherence could allow one to overcome most of these 



III. Finite rank multipartite functions 



difficulties (cf. Section [VTTTt > 



In this section, we will discuss the notion of rank, which measures the sparsity of a multilinear decomposition, 
and the notion of Kruskal rank, which measures the uniqueness of a multilinear decomposition in a somewhat more 
restrictive sense. Why is uniqueness important? It can be answered in one word: Identifiability. More specifically, a 
unique decomposition means that we may in principle identify the factors. To be completely precise, we will first 
need to define the terms in the previous sentence, namely, 'unique', 'decomposition', and 'factor'. Before we do 
that, we will introduce the tensor product notation. It is not necessary to know anything about tensor product of 
Hilbert spaces to follow what we present below. We shall assume that all our Hilbert spaces are separable and so 
there is no loss of generality in assuming at the outset that they are just L 2 (X) for some cr-finite X. 

Let Xi, . . . ,Xd be cr-finite measurable spaces. There is a natural Hilbert space isomorphism 

L 2 {X l x • • • x X d ) 5* L 2 ^) ® • • ■ ® L 2 (X d ). (8) 
In other words, every d-partite i 2 -function / : X% X • • • X X d — > C may be expressed a^] 

oo 

/(xi,...,x d ) =^2ipi p (xi) ■ ■ ■ (p dp (x d ), (9) 
p=l 

with tfkp G L 2 (Xk). The tensor product of functions (pi € L 2 (Xi), L 2 (Xd) is denoted by ipi <g> ■ ■ ■ (g) <p d 

and is the function in L 2 [X\ x • • ■ x Xd) defined by 

IfX ® • • • ® ip d (x.i, . . . , Xrf) = Pl(Xi) • • • lfid(x d ). 

With this notation, we may rewrite ([9| as 

oo 

/ = yi P ® • • • ® <Pd P - 
P =i 

A point worth noting here is that: 

"Multipartite functions are infinite-dimensional tensors." 

'Point values of L p -functions are undefined in general. So equations like these are taken to implicitly mean almost everywhere. Anyway all 
functions that arise in our applications will at least be continuous and so this is really not a point of great concern. 
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Finite-dimensional tensors are simply the special case when X±, . .. ,Xd are all finite sets (see Example RS. In 
particular, a multivariate function^] / e L 2 (R d ) is a an infinite-dimensional tensor that can expressed as an infinite 
sum of a tensor product of ipi p , . . . , ipdp <E L 2 (M) and L 2 (R d ) = L 2 (R) <g> • • • <£> L 2 (E). We shall have more to say 
about this later in conjunction with Kolmogorov's superposition principle for multivariate functions. 

In this paper, functions having a finite decomposition will play a central role; for these we define 

rank(/) := min jr G N : / = ^ tpi p ® • • • ® ^ dp | (10) 

provided / 7^ 0. The zero function is defined to have rank and we say rank(/) = 00 if such a decomposition is 
not possible. 

We will call a function / with rank(/) < r a rank-r function. Such a function may be written as a sum of r 
separable functions but possibly fewer. A decomposition of the form 

r 

f = ^2 Vl P ® ' ' ' ® Vdp (11) 

p=l 

will be called an rank-r multilinear decomposition. Note that the qualificative 'rank-r' will always mean 'rank not 
more than r'. If we wish to refer to a function / with rank exactly r, we will just specify that rank(/) = r. In this 
case, the rank-r multilinear decomposition in ( fTT| is of mininum length and we call it a rank revealing multilinear 
decomposition of /. 

A rank-1 function is both non-zero and decomposable, i.e., of the form ip% ® • • • ® (pd where ipk € L 2 (Xk). This 
agrees precisely with the notion of a separable function. Observe that the inner product (and therefore the norm) 
on L 2 (Xi x • ■ ■ x Xd) of a rank-1 function splits into a product 

(tpx ® ■ ■ ■ ® (fd,ipi <8> ■ ■ ■ ® ipd) = (fi, tpi)i ■ ■ ■ (<Pd, i>d)d (12) 

where (-,-) p denotes the inner product of L 2 (X p ). This inner product extends linearly to finite-rank elements of 

L 2 (X 1 x • ■ ■ x Xd): for / = Y^=i Vip ® * * • ® and g = ^)* =1 ^ ® • • • ® we have 

r,s 

(/»»)= ^2 (flp^lq)l - ■ ■ (fdp,1pdq)d- 
p,q=l 

In fact this is how a tensor product of Hilbert spaces (the right hand side of <j8j) is usually defined, namely, as the 
completion of the set of finite-rank elements of L 2 (X\ x • ■ ■ x Xd) under this inner product. 

When X\, . . . , Xd are finite sets, then all functions in L 2 {X\ x • • • x Xd) are of finite rank (and may in fact be 
viewed as hypermatrices or tensors as discussed in Section|n|i. Otherwise there will be functions in L 2 (X\ x ■ • • x Xd) 
of infinite rank. However, since we have assumed that Xi, . . . ,Xd are cr-finite measurable spaces, the set of all 
finite-rank / will always be dense in L 2 {X\ x • • • x Xd) by the Stone- Weierstrass theorem. 

2 We clarify our terminologies: A multipartite function is one for which the arguments xi, . . . , Xd can come from any X\ , . . . , Xd but a 
multivariate function, in the usual sense of the word, is one where X 1 , . . . , Xd are (measurable) subsets of R. So the former is a more general 
notion. 
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The next statement is a straightforward observation about rank-revealing multilinear decompositions of finite-rank 
functions but since it is central to this article we state it as a theorem. It is also tempting to call the decompo- 
sition a 'singular value decomposition' given its similarities with the usual matrix singular value decomposition 
(cf. Example [4]l. 

Theorem 2 ('Singular value decomposition' for multipartite functions). Let f E L 2 (Xi x ■ ■ • x X d ) be of finite 
rank. Then there exists a rank-r multilinear decomposition 

r 



f = y , Vp<pip ®---®<pd P (13) 

such that 



P =i 



r = rank(/), (14) 

the functions ipkp € L 2 (X p ) are of unit norm, 

\\fk P \\ = 1 for all k = 1,. . . ,d, p = l,...,r, (15) 
the coefficients o\ , . . . , o r are real positive, and 

o x > cr 2 > • • • > oy > 0. (16) 

Proof: This requires nothing more than rewriting the sum in ( fTTj i as a linear combination with the positive 
o-p's accouning for the norms of the summands and then re-indexing them in descending order of magnitudes. ■ 
While the usual singular value decomposition of a matrix would also have properties ( |14) , ( |15) , and ( [To} , the 
one crucial difference here is that our 'singular vectors' (pui, ■ ■ ■ , fkr in ( fT3| l will only be of unit norms but will 
not in general be orthonormal. Given this, we will not expect properties like the Eckhart- Young theorem, or that 



u\ H h of = ||/|| 2 , etc, to hold for ( fT3) (cf. Section VI for more details). 

One may think of the multilinear decomposition (\3\ as being similar in spirit, although not in substance, to 
Kolmogorov's superposition principle [39); the main message of which is that: 
"There are no true multivariate functions." 

More precisely, the principle states that continuous functions in multiple variables can be expressed as a compo- 
sition of a univariate function with other univariate functions. For readers not familiar with this remarkable result, 
we state a version of it here due to Kahane [38] 



Theorem 3 (Kolmogorov superposition). Let f : [0, l] d — > R be continuous. Then there exists constants Ai, . . . , G 
K and Lipschitz continuous functions ipi, . . . , ip^ : [0, 1] — > [0, 1] such that 

2d+l 

f(xi, ...,x d )= ^2 9(M<P P {zi) H h ^d(Pp(xd))- 

It is in general not easy to determine g and ipi, . . . , f2d+i given such a function /. A multilinear decomposition 
of the form ( fT3| > alleviates this by allowing g to be the simplest multivariate function, namely, the product function, 

g(ti,...,t d ) = ti_t 2 ■■■U, (17) 
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and unlike the univariate g in Theorem [3j the g in ( fT7] > works universally for any function / — only the cp p 's 
need to be constructed. Furthermore, ( fT3| ) applies more generally to functions on a product of general domains 
X\ , . . . , Xd whereas Theorem [5] only applies if they are compact intervals of R. 
At this stage, it would be instructive to give a few examples for concreteness. 

Example 4 (Singular value decomposition). Let A £ C mx " be a matrix of rank r. Then it can be decomposed in 
infinitely many ways into a sum of rank-1 terms as 

r 

A = J2 a P u P v *p (18) 

where u p £ C m and v p £ C n are unit-norm vectors and a\ > • • • > oy > 0. Note that if we regard A as a 
complex-valued function on its row and column indices i and j as described earlier in Section [77] then ( fT8| l may 
be written as 

r 

P =i 

which clearly is the same as (|9j. The singular value decomposition (SVD) of A yields one such decomposition, 
where {ui , . . . , u r } and {vi , . . . , v r } are both orthonormal. But in general a rank- revealing decomposition of the 
form < | 1 3 1 > will not have such a property. 

Example 5 (Schmidt decomposition). The previous example can be generalized to infinite dimensions. Let A : 
Hi — > H2 be a compact operator ( also known as a completely continuous operator) between two separable Hilbert 
spaces. Then the Schmidt decomposition theorem says that there exist orthonormal basis {tp p £ H2 : p £ N} and 
{ip p £ Hi : p £ N} so that 

00 

^/ = £op(^P./)ft. (19) 
P =i 

for every f £ Hi. In tensor product notation, ( |19| l becomes 

00 

A = ^°p 1 Pp®'>P* p - 
p=i 

where t/j* denotes the dual form of ip p . 

Examples [4] and [5] are well-known but they are bipartite examples, i.e. d = 2 in ( [13) . This article is primarily 
concerned with the d-partite case where d > 3, which has received far less attention. As we have alluded to in the 
previous section, the identification techniques in this article will rely crucially on the fact that d > 3. 

Example 6. Let A £ C ixmx ™ £, e a 3-dimensional hypermatrix. The outer product of three vectors u £ C , v £ C"\ 
w £ C" is defined by 

u (g) v (g> w = [UiVjWkJij^ £ C lxmxn . 
The rank of A is defined to be the minimum r £ N such that A can be written in the form 

r 

A = ^2 a p u p <g> v p (g) w p , (20) 
p=i 
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and if A = 0, then its rank is set to be 0. This agrees of course with our use of the word rank in ( |10| >, the only 
difference is notational, since ( |20| ) may be written in the form 

r 

a(i>3)k) = ^ (TpUp(i)vp(j)wp(k). 
P =i 

This definition of rank is invariant under the natural action ofGL(l) x GL(m) x GL(n) on C' xmxn 4221 Lemma 2.3], 
i.e. for any X € GL(l),Y € GL(m), Z € GL(n), 

rank((X, F, Z) ■ A) = rank(A). (21) 

77ie definition also extends easily to d-dimensional hypermatrices in C niX '" xn<J one/ when d = 2 reduces to the 
usual definition in Example ^\for matrix rank. This definition is due to F. L. Hitchcock H36V and is often called 
tensor rank. The only difference here is that our observation in Theorem [2] allows us to impose the conditions 

o~\ > 02 > • • • > ay 

and 

II u pII = II v pII = II w pII = 1, P =l,--->r, (22) 

while leaving rank(A) unchanged, thus bringing ( |20[ ) closer in form to its matrix cousin ( |18[ ). W/iflf is iosf /zere is 
that the sets {ui, . . . , u r }, {vi, . . . , v r }, {wi, . . . , w r } can no longer be chosen to be orthonormal as in Example^ 
the unit norm condition ( |22[ ) is as far as we may go. In fact for a generic A £ C ( we will always have 

r > max(7, m, n), 

and {ui, . . . , u r }, {vi, . . . , v r }, {wi, . . . , w r } will be overcomplete sets in C l , C m , C" respectively. 

Perhaps it is worthwhile saying a word concerning our use of the words 'tensor' and 'hypermatrix' : A d-tensor 
or order-d tensor is an element of a tensor product of d vector spaces Vi (S> • • • ® V<j; a d-dimensional hypermatrix 
is an element of C™ lX "' xnd . If we choose bases on Vi, . . . ,Vd, then any d-tensor A e Vi ® • • • ® will have a 
unique coordinate representation as a d-dimensional hypermatrix A e C" lX "' xnj , where = dim(Vfc). A notion 
defined on a hypermatrix is only defined on the tensor (that is represented in coordinates by the hypermatrix) if that 
notion is independent of the choice of bases. So the use of the word 'tensor rank' is in fact well justified because 
of ( pT| . For more details, we refer the reader to [45 1. 

IV. Uniqueness of multilinear decompositions 

In Theorem [2] we chose the coefficients to be in descending order of magnitude and require the factors in each 
separable term to be of unit norm. This is largely to ensure as much uniqueness in the multilinear decomposition as 
generally possible. However there remain two obvious ways to obtain trivially different multilinear decompositions: 

(i) one may scale the factors cpi p , . . . , ip^p by arbitrary unimodulus complex numbers as long as their product is 1; 

(ii) when two or more successive coefficients are equal, their orders in the sum may be arbitrarily permuted. We 
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will call a multilinear decomposition of / that meets the conditions in Theorem [2] essentially unique if the only 
other such decompositions of / differ in one or both of these manners. 

It is perhaps astonishing that when d > 2, a sufficient condition for essential uniqueness can be derived with 
relatively mild conditions on the factors. This relies on the notion of Rruskal rank, which we will now define. 

Definition 7. Let $ = {ip±, . . . , ip r } be a finite collection of vectors of unit norm in l?(X\ x • • • x Xd). The 
Kruskal rank of denoted krank $, is the largest k € N so that every k-element subset of <E> contains linearly 
independent elements. 

This notion was originally introduced in l40l . It is related to the notion of spark introduced in compressed 
sensing (26), 113211 . defined as the smallest k E N so that there is at least one fc-element subset of <£> that is linearly 
dependent. The relation is simple to describe, spark $ = krank $ + 1, and it follows immediately from the respective 
definitions. It is clear that dim span $ > krank $. 

We now generalize Kruskal's famous result |40|, |59| to tensor products of arbitrary Hilbert spaces, possibly of 
infinite dimensions. But first let us be more specific about essential uniqueness. 

Definition 8. We shall say that a multilinear decomposition of the form < | 1 3 1 > (satisfying both ( |16| l and jl5\ j is 

essentially unique if given another such decomposition, 

r r 

Vptpip ® ■ ■ ■ ® <fd P = f = ^2 ApVip ® • • • ® t/jdp, 
p=i p=i 

we must have (i) the coefficients a p = X p for all p = 1, . . . , r; and (ii) the factors <pi p , ■ ■ ■ , fdp and ipipi ■ ■ ■ > ^Pdp 
differ at most via unimodulus scaling, i.e. 

(flip = e i9l ^ lp , ...,<fd P = e i0dp ij dp (23) 

where 0\p + ■ • • + Qd p = 0mod27r, for all p = 1, . . . , r. In the event when successive coefficients are equal, 
<7 p -i > o~ p = cr p+ i = ••• = o- p+q > (Jp +9+ i, the uniqueness of the factors in (ii) is only up to relabelling of 
indices, i.e. p, . . . ,p + q. 

Lemma 9. Let f € L 2 (Xi x • • • x Xd). Then a multilinear decomposition of the form 

r 

/ = ^ o- p <pi p ®---®Wd P (24) 
P =i 

is both essentially unique and rank -revealing, i.e. r — rank /, the following condition is satisfied: 

d 

2r + d- 1 < krank $ fe , (25) 
fe=i 

where $ fc = {ip kl , tp kr } for k = 1, . . . , d. 

Proof: Consider the subspaces Vfc = span(<£fei, . . . , ipkr) m L 2 (X k ) for each k = l,...,d. Observe that 
/ G Vi ® • • • ® Vd. Clearly dim(Vfc) < r and so dim(Vi ® • • • ® Yd) < r d . Now if we could apply Kruskal's 
result [40 1 to the finite-dimensional space Vi <8> • • • <8> Yd, then we may immediately deduce both the uniqueness 
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and rank-revealing property of ( |24] >. However there is one caveat: We need to show that Kruskal rank does not 
change under restriction to a subspace, i.e. the value of krank{<^fei, . . . , ifikr} in ( p5) l is the same whether we regard 
(fki, ■ ■ ■ , <fikr as elements of L 2 (Xk) or as elements of the subspace Vfc. But this just follows from the simple fact 
that linear independence has precisely this property, i.e. if V\, . . . , v n € U C V are linearly independent in the 
vector space V, then then they are linearly independent in the subspace ILL ■ 
It follows immediately why we usually need d > 3 for identifiability. 

Corollary 10. A necessary condition for Kruskal 's inequality ( |25[ ) to hold is that d > 3. 

Proof: If d = 2, then 2r + d — 1 = 2r + 1 > krank$i + krank<i>2 since the Kruskal rank of of r vectors 
cannot exceed r. Likewise for d = 1. ■ 
Lemma [9] shows that the condition in ( |25] l is sufficient to ensure uniqueness and it is known that the condition 
is not necessary. In an appropriate sense, the condition is sharp [24|. We should note that the version of Lemma [9] 
that we state here for general d > 3 is due to Sidiropoulos and Bro l59l . Kruskal's original version l40l is only 
for d = 3. 

The main problem with Lemma [9] is that the condition |25) is difficult to check since the right-hand side cannot 
be readily computed. In fact Kruskal rank is known to be NP-complete over a field of two elements |70l . We 
conjecture that it is NP-hard over K and C. 

Kruskal's result also does not tell us how often are multilinear decompositions unique. In the event when the sets 
Xi,. . . ,X d are finite, L 2 {Xi x • • • x X d ) = C" lX ' " xn<1 where ni = #Xl, . . . , n d — #X d , and there is a simple 
result on uniqueness based simply on a dimension count. Note that the dimension of 1?(X\ x • • • x X d ) is the 
product m • • • n d and the number of parameters needed to describe a separable element of the form \ipi <£> ■ ■ ■ <g> ip d 
where ipi, . . . ,ip d are of unit norm is ni + ■ ■ ■ + n d — d + 1 (each tp^ requires — 1 parameters because of the 
unit norm constraint, the last '+1' accounts for the coefficient A). We call the number 



the expected rank of L 2 (X\ x • • • x X d ), since it is heuristically the minimum r expected for a multilinear 
decomposition ( fT3| l. 



Proposition 11. Let the notations be as above. If f G L 2 {X\ x • • ■ x X d ) has rank smaller than the expected rank, 
i. e. 

rank(J) < 



1 - d + Yl=i n k 

then f admits at most a finite number of distinct rank revealing decompositions. 

This proposition has been proved in several cases, including symmetric tensors lfl4l . but the proof still remains 
incomplete for tensors of most general form lfl3l . HI. 
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V. Estimation of multilinear decompositions 

In practice we would only have at our disposal /, a measurement of / corrupted by noise. Recall that our model 
for / takes the form 

r d 

/(xi, . . . , Xj) = Op ]~[ yfcp(xfc)- (26) 
p =i k=i 

Then we would often have to solve an approximation problem corresponding to (|2o*)> of the form 



argmm 

a££'~ , \\<Php\\=l 



(27) 



p=i k=\ 

which we will call a best rank-r approximation problem. A solution to ( |27"| ), if exists, will be called a best rank-r 
approximation of /. 

We will give some motivations as to why such an approximation is reasonable. Assuming that the norm in ( [27] ) 
is the i 2 -norm and that the factors tpk p , p = 1, . . . r and fc = 1, . . . d, have been determined in advance and we are 
just trying to estimate the parameters ai,...,a r from . . . , f( N > a finite sample of size N of measurements 
of / corrupted by noise, then the solution of the approximation problem in |27} is in fact (i) a maximum likelihood 
estimator (MLE) if the noise is zero mean Gaussian, and (ii) a best linear unbiased estimator (BLUE) if the noise 
has zero mean and finite variance. Of course in our identification problems, the factors ipkp's are not known and 
have to be estimated too. A probabilistic model in this situation would take us too far afield. Note that even for 
the case d = 2 and where the domain of / X\ x X-i is a finite set, a case that essentially reduces to principal 
components analysis (PCA), a probabilistic model along the lines of 11671 requires several strong assumptions and 
was only developed as late as 1999. The lack of a formal probabilistic model has not stopped PCA, proposed in 
1901 [ 56 1, to be an invaluable tool in the intervening century. 

VI. Existence of best multilinear approximation 

As we mentioned in the previous section, in realistic situation where measurements are corrupted by additive 
noise, one has to extract the factors </Jfc p 's and a p through solving an approximation problem ( |27) , that we now 
write in a slightly different (but equivalent) form, 



argmm 

a£[0,oo) r , ||yjfcp|| = l 



r d 

f - a p n 



(28) 



Note that by Theorem [2] we may assume that the coefficients a = (ax, ■ ■ ■ ,a r ) are real and nonnegative valued 
without any loss of generality. Such a form is also natural in applications given that a p usually captures the 
magnitude of whatever quantity that is represented by the p summand. 

We will see this problem, whether in the form ( |27] i or ( |28) , has no solution in general. We will first observe a 
somewhat unusual phenomenon in multilinear decomposition of d-partite functions where d > 3, namely, a sequence 
of rank-r functions (each with an rank-r multilinear decomposition) can converge to a limit that is not rank-r (has 
no rank-r multilinear decomposition). 
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Example 12 (Multilinear approximation of functions). For linearly independent ipi,ipi : X\ — > C, <P2,tp2 '■ X 2 
C, ip 3 , ip 3 :X 3 ^ C, let f : Xi x X 2 x X 3 -+ C foe 



/(xi,x 2 ,x 3 ) := 7/>l(Xi)^2(x2V3(x 3 ) 
For n£N, define 

/n(xi,X2,X3) : = 

n ipi (xi) + -Vl(xi) 
n 

/(X1,X 2 ,X 3 ) - / n (xi,X 2 ,X 3 ) = -[■0l(xi)-0 2 (x 2 )v33(x 3 ) 



+ ( / 9i(xi)l/' 2 (x 2 )^3(x3) + (/3i(xi)(p 2 (x 2 )'03(x 3 ). 



V2(X2) + -1/>2(X2) 



V3(X3) + -^3(X3) 



■ ri(<9l(xi)</J2(x2)l/?3(x3). 



+ ^l(Xl)v 2 (x 2 )^ 3 + ( / 3 1 (X 1 )V' 2 (X 2 )V' 3 (X 3 )]. 



Hence 



I/- /n|| =0 



(29) 



Lemma 13. /« Example 12 rank(/) = 3 iff y>i,ipi are linearly independent, i — 1,2,3. Furthermore, it is clear 
that rank(/„) < 2 and 

lim /„ = /. 

n— >oo 

Note that our fundamental approximation problem may be regarded as the approixmation problem 

argmin{||/ - /|| : rank(/) < r}, (30) 

followed by a decomposition problem 

r d 

i = ^2 a P n fk P , 

p=l k=l 

which always exists for an / with rank(/) < r. The discussion above shows that there are / for which 

argmin{||/ — /|| : rank(/) < r} = 0, 



and thus ( |28j > or ( |30| l does not need to have a solution in general. This is such a crucial point that we are obliged 
to formally state it. 

Theorem 14. For d > 3, the best approximation of a d-partite function by a sum of p products of d separable 
functions does not exist in general. 
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Proof: Take the tripartite function / 6 L 2 (Xi x X 2 x X3) in Example 12 Suppose we seek a best rank-2 
approximation, in other words, we seek to solve the minimization problem 



Now the infimum, 



argmin ||/ - 7^ ® g 2 ® g 3 - r\h\ ® /i 2 ® fts| 



inf ||/-7fl'i®52®ff3-^i ( 8/i2®^3|| =0 



since we may choose n E N sufficiently large, 



(fik+n Vfc , Vk 
9k = Ti , ZT1 11 ' "fc ~ 



ll^fc + n Vfcll ' ll^fcll ' 

for fc = 1,2,3, 

7 = n\\ipi + n _ Vi||||¥>2 + n _ V'2||||^3 + ^"Vsll, 
77 = n||pi||||(p2||||<P3||, 

so as make ||/ — jgi <E) g 2 ® 53 — ® ^2 ® ^3 II as small as we desired by virtue of ( |29] i. However there is no 
rank-2 function 731 <g> #2 <8 <?3 — 77/11 <8> /12 <8> ^3 for which 

||/-79i®fl , 2®fl'3-»?fc'i®fr2®M =0. 

In other words, the zero infimum can never be attained. ■ 
Our construction above is based on an earlier construction in [22 1 . The first such example was given in fll , 
which also contains the very first definiton of border rank. We will define it here for d-partite functions. When 
X%,... ,Xd are finite sets, this reduces to the original definiton in (4) for hypermatrices. 

Definition 15. Let / G L 2 (X\ x ■ ■ • x X$). The border rank of f is defined as 



rank(/) = min{r e N : inf ||/ - g\\ = 

over all g with rank(<?) = r}. 

Clearly we would always have that 



rank(/) < rank(/). 



The discussions above show that strict inequality can occur. In fact, for the / in Example 12 rank(/) = 2 while 
rank(j) = 3. 

We would like to mention here that this problem applies to operators too. Approximation of an operator by a sum 
of tensor/Kronecker products of lower-dimensional operators is in general an ill-posed problem whose existence 
cannot be guaranteed. 

Example 16 (Multilinear approximation of operators). For linearly independent operators $>i,*&i : Vj, — > Wi, 
i = 1,2, 3, let f : Vi ® V 2 <g> V s -> W x ® W 2 ® W 3 be 

f := \I>i <g> $2 ® $3 + $1 (g) $2 ® *3 + $1 <E> $2 (8 *3- (31) 
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If ^i's are all finite-dimensional and represented in coordinates as matrices, then '®' may be taken to be 
Kronecker product of matrices. For n € N, 



T„ := n 















n 


$1 + 












n 




n 




n 



Then 



lim T„ = T. 



An example of an operator that has the form in \2>\\ is the "Sin- dimensional Laplacian which can be expressed 

in terms of the m-dimensional Laplacian A m as 

^3m — A m <g> I ® I + I <8> A m ® I + J ® I ® A m . 
There are several simple but artificial ways to alleviate the issue of non-existent best approximant. Observe from 



the proof of Theorem 14 that the coefficients in the approximant 7, 77 becomes unbounded in the limit. Likewise 



we see this happening in Example 16 In fact this must always happen — in the event when a function or operator 



is approximated by a rank-r function, i.e. 



p=i k=i 



or 



1$ 



kp 



P =i 



k=l 



(32) 



and if a best approximation does not exist, then the r coefficients oci, . . . , a r must all diverge in magnitude to +00 
as the approximant converges to the infimum of the norm loss function in ( [32| ). This result was first established in 
11221 Proposition 4.9]. 

So a simple but artificial way to prevent the nonexistence issue is to simply limit the sizes of the coefficients 
ai, . . . , a r in the approximant. One way to achieve this is regularization ll55l . B6l — adding a regularization term 
to our objective function in |28) to penalize large coefficients. A common choice is Tychonoff regularization, which 
uses a sum-of-squares regularization term: 



argmm 

*e[o,oo) r , \\ip kp 



p=i 



(33) 



P =i k=i 

Here A is an arbitrarily chosen regularization parameter. It can be seen that this is equivalent to constraining the 
sizes a\, . . . , a r to X)p=il a p| 2 = P< p being determined a posteriori from A. The main drawback of such 
constraints is that p and A are arbitrary, and that they generally have no physical meaning. 

More generally, one may alleviate the nonexistence issue by restricting the optimization problem ([30} to a compact 
subset of its non-compact feasible set 



Limiting the sizes of a±, 
In ifTTl . the factors (p^i, ■ 



{f&L 2 {X x x • • • x X d ) : rank(/) < r}. 

. , a r is a special case but there are several other simple (but also artificial) strategies. 
, ifikp are required to be orthogonal for all fee {1, . . . , d}, i.e. 

((Pkp, ( Pkq)k = <W P)?=l>--->r> k=l,...,d. (34) 
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This remedy is acceptable only in very restrictive conditions. In fact a necessary condition for this to work is that 

r< min dim L 2 (Xk). 
k=l,...,d 

It is also trivial to see that imposing orthogonality between the separable factors removes this problem 

(ipip ® • • • ® ifdp,fi q ® • • • ® <^rfg) = <5 P9 , P, 9 = 1, ■ ■ • ,r. (35) 

This constraint is slightly less restrictive — by ( fT2| ), it is equivalent to requiring ([34) for some k G {1, . . . ,d}. 
Both ([34) and ( [35) are nonetheless so restrictive as to exclude the most useful circumstances for the model (13) , 
which usually involves factors that have no reason to be orthogonal, as we will see in Section [IX] In fact, Kruskal's 
uniqueness condition is such a potent tool precisely because it does not require orthogonality. 

The conditions p4) , ( [35) , and ( p3) all limit the feasible sets for the original approximation ( |28) to a much smaller 
compact subset of the original feasible set. This is not the case for nonnegative constraints. In 11461 it was shown 
that the following best rank-r approximation problem for a nonnegative-valued / and where the coefficients a p and 
factors tfkp of the approximants are also nonnegative valued, i.e. 



argmm 

*6[0,oo) r , ||¥>fc p ||=l, <p kp >0 



p=l k=l 

always has a solution. The feasible set in this case is non-compact and has nonempty interior within the feasible set 
of our original problem (|28). The nonnegativity constraints are natural in some applications, such as the fluorescence 



spectroscopy one described in Section IX-F where cp^p represent intensities and concentrations, and are therefore 
nonnegative valued. 

There are two major problems with imposing artificial constraints simply to force a solution: How do we know 
a priori that the solution that we seek would meet those constraints? But more importantly, perhaps the model is 
ill-posed and a solution indeed should not exist? To illustrate the case in point with a more commonplace example, 
suppose we want to find a maximum likelihood estimator X 6 R nxn for the covariance E of independent samples 
yi, . . . , y m ~ N(0, E). This would lead us to a semi-definite programming problem 

argmin tr(X" 1 r) - log det(X) (36) 

where Y = ^ Y^iLi YiYi ■ However the problem will not have a solution when the number of samples is smaller 
than the dimension, i.e. m < n, as the infimum of the loss function in ( |3"6) cannot be attained by any X in the 
feasible set. This is an indication that we should seek more samples (so that we could get m > n, which will 
guarantee the attainment of the infimum) or use a different model (e.g. determine if X^ 1 might perhaps have 
some a priori zero entries due to statistical independence of the variables). It is usually unwise to impose artificial 
constraints on the covariance matrix X just so that the loss function in ([36) would attain an infimum on a smaller 
feasible set — the thereby obtained 'solution' may bear no relation to the true solution that we want. 



Our goal in Section VIII-A is to define a type of physically meaningful constraints via the notion of coherence. 
It ensures the existence of a unique minimum, but not via an artificial limitation of the optimization problem to 
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a convenient subset of the feasible set. In the applications we discuss in Section IX we will see that it is natural 



to expect existence of a solution when coherence is small enough, but not otherwise. So when our model is ill- 
posed or ill-conditioned, we are warned by the size of the coherence and could seek other remedies (collect more 
measurements, use a different model, etc) instead of forcing a 'solution' that bears no relation to reality. But before 
we get to that we will examine another method based on an approximation of rank by a ratio of Schatten 1- and 
oo-norms. 



VII. Schatten and Ky Fan norms 
We introduce the notion of Schatten norms and Ky Fan norms for multipartite functions and see how a condition 



involving the Schatten 1-norm and Schatten oo-norm allows us to alleviate the problem discussed in Section VI 
namely, that a d-partite function may not have a best approximation by a sum of r separable functions. 

The definition of Schatten norm follows naturally from the definition of rank in Section [In] and from which the 
definiton of Ky Fan norm is immediate. 

Definition 17. For any 1 < s < oo, we define the Schatten s-norm of f € l?(X\ x • ■ ■ x X&) as 



l/lk 



inf< 



1/8 



P =i 



Ikfepll = 1, A p > A p+1 > j (37) 

with the usual modification (replace sum by supremum) for the case s = oo. 

Definition 18. For any 1 < s < oo and any k £ N, we define the Ky Fan (s, k)-norm of f £ L?(X\ x • ■ ■ x Xj) 



as 



l/ll 



inf < 



' k 



1/3 



■ f = 22 Ap^ip <8> • • • <8> (fd P , 
P =i 



hk P \\ = 1, Ap > A p+1 > } (38) 



with again the usual modification for the case s = oo. 



The letters s and k are of course chosen to remind us of the respective eponymous norms. The fact that ( |3"7j ) 
and ([38| define norms on L?{X\ x • • • x Xj) follows from the standard Minkowski gauge argument [23|. The 
Ky Fan and Schatten norms are related by ||/||* s oo = 11/11* s- When X\, . . . ,Xd are finite sets of cardinalities 
ni, ...,n<j G N, the Schatten 1-norm for the unipartite case (d = 1) is just the usual ^ 1 -norm for vectors in 
C" 1 = L 2 (Xi); the Schatten s-norm for the bipartite case (d = 2) agrees with the usual Schatten s-norm for 
matrices in C™ 1 *™ 2 = L 2 (Xi x X 2 )- In particular, the bipartite Schatten 1, 2, and oo-norms are respectively the 



nuclear, Frobenius, and spectral norms of a matrix. For d > 3, Definitions 17 and 18 yield notions of Schatten and 
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Ky Fan norms for hypermatrices in C niX '" x ™ <i = L 2 (Xi x • • • x Xj) when X% are finite sets (with elements) 
for all k — 1 , . . . , d. 



Example 19 (Nuclear norm for 3-tensors). Let T £ C niX " 2Xn3 . Then by Definition 17 we have 



||T||*,i = inf i X p : T = X p*p ® y P z p f ' 

where the infimum is taken over all linear combinations of complex vectors of unit 2-norm x p £ C™ 1 , y p £ C" 2 , 
z p € C™ 3 , with real positive coefficientss X p £ [0, oo), and p = 1, . . . , r, with r £ N. 



Note that we used the term tensors, as opposed to hypermatrices, in the above example. In fact, Definitions 17 
and [18] define Schatten and Ky Fan norms for the tensors, not just their coordinate representations as hypermatrices 
(see our discussion after Example [6]), because of the following invariant properties. 

Lemma 20. The Schatten and Ky Fan norms as defined in ( |37| ) and ( |38| > for C niX '" xn<J are unitarily invariant, 
i.e. invariant under the natural action ofU(n±) x • • • x U{n^) where U(n) denotes the group of unitary matrices 



Proof: To avoid the clutter of indices, we will assume that d = 3. It is easy, although notationally cumbersome, 
to extend this to general d > 3. Let (U, V, W) G U(n{) x U{n 2 ) x U(n 3 ) and T € C" lX ™ 2X ™ 3 . The natural action, 
given in coordinates by 

(U,V,W)-T = 



^ UaiVbjWckUjk 
i,j,k=l 

has the property that if T has a multilinear decomposition of the form 

r 

t = 22 \ x p ® yp ® z i" 

then 



111,112,13 



,fc,C=l 



(39) 



(CT, V, W) • T = ^ A p (C/x p ) ® (Vy p ) ® (Wz„), 
p=i 

( |39l > is obvious when r = 1 and for general r follows from the linearity of the action (i.e. (U, V, W) ■ (S + T) = 
(U,V,W) ■ S + (U,V,W) ■ T). We also need the simple fact that U{n±) x U{n 2 ) x U(n 3 ) acts transitively on 
unit-norm rank-1 tensors, i.e. take any x £ C™ 1 , y £ C™ 2 , z £ C™ 3 of unit norm, then every other unit-norm rank-1 
tensor may be written as Ux (8 Vy ® for some ([/, V, W) € U(ni) x U{ri2) x U{n 3 ). With these, it follows 



immediately from Definition 17 that Schatten norms satisfy 

||(C/,y,W)-T|U i5 = ||T|| M 

and likewise for Ky Fan norms. 
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If we allow < s < 1, then ( |37| > and ( |38] l no longer define norms although the Schatten and Ky Fan quasinorms 

{k oo 
X! A* : / = ^2 KVip ® ■ ■ ■ ® 'Pdp, 
p=i p=i 

HVfcpll = 1, A p > A p+ i > o| 

are nonetheless still interesting measures of nearness. Observe that we have dropped the l/s power here. In particular, 
the limiting case as s — > yields the Schatten Q-quasinorm, 

{oo oo 
x p : f = x pvip ® • ■ • ® (fd P , 
p=l p=l 

\\Vkp\\ = 1, A p > Ap+i > o|, 

where we adopt the convention that 0° = 0. 

Lemma 21. For any / G L 2 (Xi x ■ ■ • x J>Q), we nave ||/||* o = rank(/) and 

||/lki < rank(/) x ||/||, i00 . (40) 

/n of/zer words, the Schatten 1-norm is a convex underestimator of rank on the Schatten co-norm unit ball {/ : 
||/||* oo < !}• In fact, (|40|l may fee sharpened to have border rank in place of rank 



||/lki <rank(/) x ||/||, i0t) . (41) 

Proof: Let rank(/) = r and / = X)p=i ^p'Pip ® • • • ® <^<jp be a multilinear decomposition as in Theorem |5j 
Therefore ||y?ip<8>- • ■<E)(fidp\\*,i = \\<Pip\\ ■ ■ • IkdpH = 1 for allp = 1, . . . ,r. The triangle inequality immediate yields 



11/11*, i ^ Sp=i|Ap||kip (8 • • • (8 fd P \\*,i = Z)p=i|A P | < '"ll/ll*, oo- For the border rank version, let rank(/) = r, 
then there exists a sequence /„ where rank(/ n ) = r and lirrijj^oo /„ = /. By what we have just proved, 

||/n||*,i < rank(/„) x ||/ n ||* >0O = r\\ fn 1 1 * ,oo • 

Taking limits and using the continuity of norms immediately yields ( |4T| >. ■ 
It is known that the ^ 1 -norm is the largest convex underestimate^] of the 0-quasinorm on the £°°-norm unit ball 
P31 and that the nuclear norm is the largest convex underestimator of rank on spectral norm unit ball ||29ll . We 
suspect that a vast generalization of this observation is true, namely, the Schatten 1-norm as defined in ([37) is the 
largest convex underestimator of the rank function as defined in (flU). We are however unable to prove nor disprove 



this stronger version of Lemma 21 i.e., with 'largest convex underestimator' in place of 'convex underestimator'. 



The condition d40l in Lemma |2T| provides a simple way for alleviating the fact that by replacing the condition 



rank(/) < r by the condition ||/||*,i < r||/||* i00 . Recall that the discussion in Section VI shows that there are / 



Also called the greatest convex minorant, in this case also equivalent to the Legendre-Frenchel biconjugate or convex biconjugate. 
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for which 

argmin{||/ - /|| : rank(/) < r} = 0, 

which really results from the fact that 

{f E L 2 (X X x ■ ■ ■ x X d ) : mnk(f) < r} 



is not a closed set. The condition ( |4Q| i we derived may then be used as a work-around relaxation where we have 
the ratio ||/||*,i/||/||*,oo as a 'proxy' in place of rank(/). 

Theorem 22. Let f e L 2 {X\ x ■ ■ ■ x Xd). For any rgN, the optimization problem 

argmin{||/-/|| : < r||/||, i00 } 

always has a solution. 

Proof: The follows from the fact that the set 

{feL 2 (X 1 x---xX d ): H/IU.1 < r||/|U l00 } 

is closed, which follows easily from the continuity of the real-valued function || • ||„ i — r\\ ■ \\* l0 c on L 2 (X\ x ■ ■ ■ x 
Xj), which in turn follows easily from the fact that all norms are continuous functions. ■ 
We would like to add a few words about the ratio 



\\J II*, oo 

Note that this gives a crude notion of 'numerical rank' for a <i-partite function. Recall that when d = 2 and 
X\,Xi are finite, bipartite functions may be regarded as matrices, i.e. L 2 {X\ x X<j) = C" lX ™ 2 where #X\ — m, 
4fX 2 = n 2 . In this case, the ratio of the Frobenius norm to the spectral norm and the ratio of the nuclear norm to 
the spectral norm are occasionally used as proxies for matrix rank, often in scenarios where the use of matrix rank 
would be computational intractable. These ratios are used because of bounds that are the analogues of d40|, 



\\A\\ F < A/rank^) || A|| 2 and < rank(A)||A|| 



for A g C" lX ™ 2 . In fact the inequality on the right is exactly ( |40| > applied to the case of bipartite functions, i.e. 
matrices. 



Theorem 22 represents a simple, elegant solution to the non-existence problem in Section VI but we do not find 
it useful for the applications that we consider here. Instead another workaround that uses the notion of coherence, 
discussed in the next section, is more naturally applicable in our situtations. 

VIII. Coherence 

We will show in this section that a simple measure of angular constriant called coherence, or rather, the closely 
related notion of relative incoherence, allows us to alleviate two problems simultaneously: the computational 
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intractability of checking for uniqueness discussed in Section IV and the non-existence of a best approximant 
in Section IVTl 

Definition 23. Let W be a Hilbert space provided with scalar product (•, ■), and let <J> C H be a set of elements of 
unit norm in EL The coherence of $ is defined as 

//($) = sup \{<p,ip)\ 

where the supremum is taken over all distinct pairs (p,if> G H. If $ = ■ ■ . , <p r } is finite, we also write 
tt(<Pi, ■■-,'Pr) ■= max p ^ 9 |((^p,^ 9 )|. 



We adopt the convention that whenever we write /i(3>) (resp. fx((pi, . . . , tp r )) as in Definition 23 it is implicitly 
implied that all elements of $ (resp. <px,..., <p r ) are of unit norm. 

Th notion of coherence has received different names in the literature: mutual incoherence of two dictionaries 
11261 . mutual coherence of two dictionaries [9], the coherence of a subspace projection (8), etc. The version here 
follows that of |32|. Usually, dictionaries are finite or countable, but we have here a continuum of atoms. Clearly, 
< /i( ( E > ) < 1, and /i(<J>) = iff <px, . . . , <p r are orthonormal. Also, — 1 iff $ contains at least a pair of 

collinear elements, i.e. <p p = Xip q for some p ^ q, A ^ 0. 

We find it useful to introduce a closely related notion that we call relative incoherence. It allows us to formulate 
some of our results slightly more elegantly. 

Definition 24. Let <I> C H be a set of elements of unit norm. The relative incoherence of $ is defined as 

For a finite set of unit vectors $ = {(p±, . . . , <p r }, we will also write u(ipi, . . . , tp r ) occasionally. 

It follows from our observation about coherence that < w($) < oo, w($) = cx) iff (pi, . . . , tp r are orthonormal, 
and w($) = iff $ contains at least a pair of collinear elements. 

In the next few subsections, we will see respectively how coherence can inform us about the existence (Sec- 
tion VIII-A i, uniqueness (Section VIII-B I, as well as both existence and uniqueness (Section VIII-C i of a solution 



to the best rank-r multilinear approximation problem (|28|. We will also see how it can be used for establishing 



exact recoverability (Section VIII-Di and approximation bounds (Section VIII-Ei in greedy algorithms. 



A. Existence via coherence 

The goal is to prevent the phenomenon we observed in Example [12] to occur, by imposing natural and weak 
constraints; we do not want to reduce the search to a compact set. It is clear that the objective is not coercive, 
which explains why the minimum may not exist. But with an additional condition on the coherence, we shall be 
able to prove existence thanks to coercivity. 

The following shows that a solution to the bounded coherence best rank-r approximation problem always exists: 
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Proposition 25. Let f e L 2 (Xi x • • • x X d ) be a d-partite function. If 



k=l 



or equivalently if 



then 



f[(l + u k ) >r-l 
i 

d 1 



(42) 



(43) 



fc=i 



inf 



p=l 



A € C, Cj(<^fcl,. . . , Ifkr) > Wfe 



(44) 



is attained. Here || • || denotes the L 2 -norm on I?{X\ x ■ ■ • x Xj) and A = (Ai, . . . , A,). If desired, we may assume 
that AgM r and Ai > • • • > A r > by Theorem [2] 



Proof: The equivalence between ( |42| > and ( |43| > follows from Definition 24 We show that if either of these 
conditions are met, then the loss function is coercive. We have the following inequalities 

2 w 



Ap^ip <& • • • <g> 

p=l 



P,<J=1 fc=l 



> ApAp FJ ||^ fe p| 
p=i fe=i 



E 



ApA g J^[(^fcp,^fc ? ) 



fe=i 



p=l k=l p^q 

>||A||»-(r-l)||A||in^ 

k=l 



where the last inequality follows from 



]T|ApA 9 | = 2^|ApA 9 | < £(|A P | 2 + |A g | 2 ) = (r - 1)||A| 

p#<3 P<? P<9 



This yields 



^ Apl^lp ® • • • (g) (y9 d p 

p=l 



> 



l-(r-l)TT/* 

k=l 



|A||1 



(45) 
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Since by assumption (r — 1) rifc=i Mfe < 1» it i s cl ear that the left hand side of |45) tends to infinity as || A|| 



And because / is fixed, 



also tends to infinity as ||A||2 — > oo. This proves coercivity 
of the loss function and hence the existential statement. ■ 
The condition (|42|i or, equivalently, ([43)1, in Proposition 25 is sharp in an appropriate sense. Proposition 25 shows 



that the condition ( |43) is sufficient in the sense that it guarantees a best rank-?' approximation when the condition 
is met. We show that it is also necessary in the sense that if ( |43| > does not hold, then there are examples where a 
best rank-r approximation fails to exist. 

In fact, let / be as in Example 12 As demonstrated in the proof of Theorem 14 the infimum for the case d = 3 
and r = 2, 



inf ||/ - Xgi ® g 2 ® g 3 - \xh\ ® hi ® /13II 

II fffc II = II ^fe 11 = 1-. -Va<>0 



is not attained. Since 



_ ip k + n y fc 
H^fe + n~Vfe|| 

for fc = 1, 2, 3, the corresponding coherence 



hk = 



\<Pk\ 



n(gk,h k ) > \{g k ,h k )\ 



1 



as n — » 00. For any values of /Lti,/X2,A*3 <= [0, 1] such that ( |43| ) holds, i.e. [i\[i 2 [i 3 < l/( r — 1) = L we cannot 
possibly have /i(<7fe, /ife) < [i k for all fc = 1, 2, 3 since 



li(g 1 ,h 1 )^{g 2 ,h 2 )fi(g 3 ,h 3 ) 



1 



as n -> 00. 



2?. Uniqueness via coherence 

In order to prove uniqueness, we need a simple observation and the notion of Kruskal rank introduced in 
Definition [7] 

Lemma 26. Let $ C L 2 (AT! x ■ • • x X d ) be finite. Then 

krank $ > —j— (46) 

Proof: Let s = krank $ + 1. Then there exists a subset of s distinct unit vectors in <£>, {$1, . . . , $ s } such 
that o;i$i + ■ • • + a s & s = with |ai| = max{|ai|, . . . , |a s |} > 0. Taking inner product with $1 we get a\ = 

-a 2 (&2, $1) a s (<I> s , $1) and so |ai| < (|a 2 |H h|a s |)/x(V). Dividing by |ai| then yields 1 < (s-l)^i($). 

■ 

We now characterize the uniqueness of the rank revealing decomposition in terms of coherence introduced in 
Definition |231 
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Proposition 27. Let f € i 2 (X! x • • • x X^) /zave a rank-r multilinear decomposition 

r 

/ = E Ap^ip ® • • • ® 
P =i 

where $^ := {</9fci, . . . , flie elements in L 2 (Xk) of unit norm for all k = 1, . . . ,d. Let u>k — w($fc). If 

d 

^u; fe >2r-l, (47) 
fe=i 

f/ien r = rank(/) and the decomposition is essentially unique. 

Proof: Inequality p7] > implies that J2k=i 1 > 2r + <i — 1, where /i^ denotes If it is satisfied, then 

so is Kruskal's condition ( |25] > thanks to Lemma 26 The result hence directly follows from Lemma [9] ■ 



Note that unlike the fc -ranks in p5) , the coherences in ( |47] > are trivial to compute. In addition to uniqueness, an 
easy but important consequence of Proposition [27] is that it provides a readily checkable sufficient condition for 
tensor rank, which is NP-hard over any field BP . P2l . 

C. Existence and uniqueness via coherence 



Now the following existence and uniqueness sufficient condition can be deduced from Propositions 25 and 27 

Corollary 28. If d> 3 and if coherences \Xk satisfy 

( d \ ^ d 

then the bounded coherence best rank-r approximation problem has a unique solution up to unimodulus scaling. 

Proof: The existence in the case r = 1 is ensured, because the set of separable functions {tpi <g> • • • <g> tp d \ 
ifik € L 2 (Xk)} is closed. Consider thus the case r > 2. Since the function f(x) — ~ — ^x+d-i ) * s stI "i c tly 
positive for x > 2 and d > 3, condition ( [48] ) implies that Ilfe=i Mfe * s smaller than 1/r, which permits to claim that 



the solution exists by calling for Proposition 25 Next in order to prove uniqueness, we use the inequality between 
harmonic and geometric means: if ( |4"8j ) is verified, then we also necessarily have d ^X)fc=i 1 ) < 2r+d-i • 
Hence 53fc=i 1 > 2r + — 1 and we can apply Proposition 



27 



In practice, simpler expressions than ( |48| > can be more attractive for computational purposes. These can be derived 
from the inequalities between means: 

(5 XX 1 ) <(fl^) ^X>^X>*) ■ 

\ fc=i / \fc=i / fe=i \ fe=i / 

Examples of weaker sufficient conditions that could be used in place of ( |4"8j ) include 

d 2 

2r + d-V 



d J2 



fc=i 



4 2r + d - 1 

fe=i 
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Another simplification can be performed, which yields differentiable expressions of the constraints if ( |50] > is to 
be used. In fact, noting that for any set of numbers x%, ...,x n € C, maxi = i.... n |a;i| < V/C^il 2 -*! 2 ' a sufficient 
condition ensuring that d50l is satisfied, and hence (|48)>, is 



' d 



k=l p<q 



2r + d-l 



D. Exact recoverability via coherence 

We now describe a result that follows from the remarkable work of Temlyakov. It allows us to in principle 



determine the multilinear decomposition meeting the type of coherence conditions in Section |VHI-A| 

Some additional notations would be useful. We let $ C {/ £ L 2 (X\ x • ■ • x Xj : rank(/) = 1} be a dictionar^ 
of separable functions (i.e. rank-1) in L 2 (Xi x • ■ ■ x Xj) that meets a bounded coherence condition , i.e. 

M$) < A* (51) 
for some /i £ [0, 1) to be chosen later. Recall that the elements of $ are implictly assumed to be of unit norm 



23 1. Note that in Proposition 25 we had /i = IIfc=i Mfc ^ ut we w °uld not impose this 



(cf. remark after Definition 
here. 

Let t £ (0, 1]. The weakly orthogonal greedy algorithm (WOGA) is simple to describe: Set fo = /. For each 
m £ N, we inductively define a sequence of / m 's as follows: 



1) ffm € $ is any element satisfying 



\{fm-l,9m)\ > *sup|(/ m _i,p)|; 

36-D 



2) h m £ L 2 (Xi x • • • x Xd) is a projection of / onto span(g 1; . . . , g m ), i.e. 

h m £ argmin{||/ - g\\ : g £ span(gi, . . . , g m )}; (52) 

3) f m € L 2 (Xi x • • • x X<i) is a deflation of / by h m , i.e. 

fm — / . 

Note that by Proposition 25 the projection in ( |52| ) is well-defined, i.e. a minimizer h m always exist. The following 



result, adpated here for our purpose, was proved for any arbitrary dictionary in [66]. Also note that deflation generally 
does not work to compute multilinear decompositions, as pointed out in [ 63 ] . 

Theorem 29 (Temlyakov). Suppose f £ l?(X x x • • ■ x X(i) has a multilinear decomposition 

r 

f = Ap^ip ® • • • ® (p d p 
P =i 

with (fip (g> • • • (g) ip dp £ & and the condition that 

1 fi 1 

r < 1 + - 

i + t V n 

4 A dictionary is any set $ C H whose linear span is dense in the Hilbert space 
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for some t € (0, 1] and /i = Jlfc=i Mfe- Then the WOGA algorithm recovers the factors exactly, or more precisely, 
f r = and thus f = h r . 

So h r , by its definition in ( |52| and our choice of $, is given in the form of a linear combination of rank-l 
functions, i.e. an rank-r multilinear decomposition. 



E. Greedy approximation bounds via coherence 



This discussion in Section |VIII-D| pertains to exact recovery of a rank-r multilinear decomposition although our 
main problem really takes the form of a best rank-r approximation more often than not. We will describe some 
greedy approximation bounds for the approximation problem in this section. 

We let 



inf 

a£C r , llvfepl 



p=l k=l 



By our definition of rank and border rank, 

C r r (/) = inf{||/-/|hrank(/)<r} 



min {|l/ - /II : rank(/) < r}. 



It would be wonderful if greedy algorithms along the lines of what we discussed in Section VIII-D could yield an 
approximant within some provable bounds that is a factor of oy(/). However this is too much to hope for mainly 
because a dictionary comprising all separable functions, i.e. {/ : rank(/) = 1} is far too large to be amenable to 
such analysis. This does not prevent us from considering somewhat more restrictive dictionaries like what we did 
in the previous section. So again, let <E> C {/ e I?(X\ x • • ■ x Xd) : rank(J) = 1} be such that 

for some given /i 6 [0,1) to be chosen later. Let us instead define 

Sr(f) 

Clearly 



inf 



f -^2 a p ip p 
P =i 



<Tr(f) < Sr(f) 



(53) 



since the infimum is taken over a smaller dictionary. 



The special case where t = 1 in the WOGA described in Section VIII-D is also called the orthogonal greedy 
algorithm (OGA). The result we state next comes from the work of a number of people done over the last decade: 
(|54j is due to Gilbert, Muthukrisnan, and Strauss in 2003 EH; ([55} is due to Tropp in 2004 |68|; |56| is due to 
Dohono, Elad, and Temlyakov in 2006 1271 : and ( |57] > is due to Livshitz in 2012 11481 . We merely apply it to our 
approximation problem here. 



Theorem 30. Let f € L 2 {X\ x • ■ ■ x Xj) and f r be the rth iterate as defined in WOGA with t — 1 and input f. 
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1) Ifr< i/x 



, then 



\\f-fr\\<%r l ' 2 s r {f). 



(54) 



2) Ifr<y- 



, then 



\\f-fr\\<{l + Zry/ 2 s r {f). 



(55) 



3) Ifr< ^M" 2/3 . then 



||/-/rlogr|| <24a r (/). 



(56) 



4) //r<4/i 



, f/ien 



||/-/2r||<3s r (/). 



(57) 



It would be marvelous if one could instead establish bounds in ( |54) , ([33}, ( |56| ), and ( f57] > with <r r (/) in place of 
s r (/) and {/ : rank(/) = 1} in place of <£>, dropping the coherence fi altogether. In which case one may estimate 



how well the rth OGA iterates f r approximates the best rank-r approximation. This appears to be beyond present 
capabilites. 



Various applications, many under the headings of CANDECOMP |12| and PARAFAC ll35l . have appeared in 
psychometrics and, more recently, also other data analytic applications. We found that many of such applications 
suffer from a regretable defect — there are no compelling reasons nor rigorous arguments that support the use of a 
rank-r multilinear decomposition model. The mere fact that a data set may be cast in the form of a d-dimensional 
array A G C" lX "' x,ld does not mean that ( fT3j ) would be the right or even just a reasonable thing to do. In particular, 
how would one even interpret the factors <pk P 's when d > 2? When d = 2, one could arguably interpret these as 
principal or varimax components when orthonormality is imposed but for general d > 2, a convincing application 
of a model based on the rank-r multilinear decomposition ( fT3| must rely on careful arguments that follow from 
first principles. Other than CANDECOMP [ 12 1 and PARAFAC, the finite-dimensional multilinear decompositions have 
also been studied under the names CP, CAND, canonical decomposition, canonical polyadic decompositions, etc. 

The goal of this section is two-fold. First we would like to provide a selection of applications where the rank-r 
multilinear decomposition ( fT3] > arises naturally via considerations of first principles (in electrodynamics, quantum 
mechanics, wave propagation, etc). Secondly we want to also demonstrate that the coherence conditions discussed 
extensively in Section [VIII| invariably have reasonable interpretations in terms of physical quantities. 

The use of a rank-r multilinear decomposition model in signal processing via the higher-order statistics has a 
long history ||57l . l30l . ifTBI . ifTTI . l58l . Our signal processing applications here are of a different nature, they are 
based on geometrical properties of sensor arrays instead of considerations of higher-order statistics. This line of 
argument first appeared in the work of Sidiropoulos and Bro |60|, which is innovative and well-motivated by first 
principles. However, like all other applications considered thus far, whether in data analysis, signal processing, 
psychometrics, or chemometrics, it does not address the serious nonexistence problem that we discussed at length 



IX. Applications 
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in Section VIII-A Without any guarantee that a solution to ( |28| ) exists, one can never be sure when the model 
would yield a solution. Another issue of concern is that the Kruskal uniqueness condition in Lemma [9] has often 
been invoked to provide evidence of a unique solution but as we have discussed in Section |IV| this condition 
is impossible to check since there is no known way to efficiently computing the Kruskal rank. The applications 



considered below would use the coherence conditions developed in Section VIII to avoid these difficulties. More 



precisely, Proposition 25 Proposition 27 and Corollary 28 are invoked to guarantee the existence of a solution 



to the approximation problem and provide readily checkable conditions for uniqueness of the solution, all via the 
notion of coherence. 

In this section, applications are presented in finite dimension. In order to avoid any confusion, X* , X H and X T 
will denote complex conjugate, hermitian transpose, and transpose, of the matrix X respectively. 

A. Joint channel and source estimation 

Consider a narrow band transmission problem in the far field. We assume here that we are in the context of 
wireless telecommunications, but the same principle could apply in other fields. Let r signals impinge on an array, 
so that their mixture is recorded. It is wished to recover the original signals, and to estimate their directions of 
arrival and respective powers at the receiver. If the channel is specular, some of these signals can correspond to 
different propagation paths of the same radiating source, and are hence correlated. In other words, r does not denote 
the number of sources, but the total number of distinct paths viewed from the receiver. 

In the present framework, we assume that channels can be time-varying, but that they can be assumed constant 
over a sufficiently short observation length. The goal is hence to be able to work with extremely short samples. 

In order to face this challenge, we assume that the sensor array is structured, as in [60 1. More precisely, the sensor 
array is composed of a reference array containing n\ sensors, whose location is defined by a vector £ M 3 , and 
ri2 — 1 other subarrays, deduced from the reference array by a translation in space defined by a vector Aj £ R 3 , 
1 < j < ri2. The reference subarray is numbered with j = 1 in the remainder. 

Under these assumptions, the signal received at discrete time tk, k = 1, . . . , n^, on the ith sensor of the reference 
subarray can be written as: 

r 

Si,i(k) = y^crpfa)exp(V^p) 
P =i 

with ipi jP = ]^(hjA p ) where the dotless j denotes y— T, vector d p is unit norm and denotes the direction of 
arrival of the pth path. Next, on the jth subarray, j > 1, we have 

r 

s i,ji k ) = ^2 CT p(*fc) exp(V>ij : p) (58) 
P =i 

with ipij.p — j^(b^dp + Ajdp). If we let A x be the null vector, then ( |58] l also applies for the reference subarray. 
The interest of this structure is that variables i and j decouple in function exp^ij.p), yielding a relation resembling 
the rank revealing multilinear decomposition: 

r 
p=l 
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where u ip = exp (j^bjd p ), v jp = exp fjgAjdp) and w kp = a p (t k )/\\cr p \\, X p = \\cr p \\. 

Hence, by computing the rank revealing decomposition of the tensor S = {sij(k)) E c«axn 2 xn 3i ^ j s p 0SS jble 
to jointly estimate: (i) signal waveforms a p (k), and (ii) the directions of arrival d p of each propagation path if b; 
or Aj are known. 

However, the observation model ( |58j ) is not realistic, and an additional error term should be added in order to 
stand for modeling inaccuracies and background noise. It is customary (and realistic thanks to the central limit 
theorem) to assume that this additive error has a continuous probability distribution, so that tensor S has a generic 
rank. Yet, the generic rank takes values at least as large as \nin2n^/{ni + n 2 + n 3 — 2)], which is always larger 
than Kruskal's bound l20l . Therefore, we have to face the problem of approximating tensor S by another of rank r. 



And we have seen that the angular constraint imposed in Section |VIII| permits to deal with a well-posed problem. 
In order to see the physical meaning of this constraint, it is convenient to define first the tensor product between 
subarrays. 

B. Tensor product between sensor subarrays 

The sensor arrays we cope with are structured, in the sense that the whole array is generated by one subarray, 
defined by the collection of vector locations {b^ G M 3 : 1 < i < n%}, and a collection of translations in space, 
{A.j 6 M 3 : 1 < j < n 2 }- If we define vectors 

exp (j^bjdp) 
exp (j^Ajd p 



1 



"i 
1 



<T p 



i=l 

"2 



(59) 

3=1 



ll"pl 

then this means that we may see all measurements as the superimposition of decomposable tensors: 

XpU p <g> v p (g) w p . 

The geometry of the sensor array is contained in u p (g) v p , whereas \ p and w p contain energy and time information 
on each path p, respectively. Note that the reference subarray and the set of translations play symmetric roles, in 
the sense that u p and v p could be interchanged without changing the whole array. This will become clear with a 
few examples. 

When we are given a structured sensor array, there can be several ways of splitting it into a tensor product of 
two (or more) subarrays, as now shown by simple examples. 

Example 31. Define the matrix of sensor locations 

1 
1 1 

This subarray is depicted in Figure \l\b. By translating it according to the translation defined in Figure [7]c one 
obtains another subarray. The union of the two subarrays yields the array of Figure^a. The same array is obtained 



[bi,b 2 ,b 3 ] = 
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by interchanging the roles of the two subarrays, i.e. three subarrays of two sensors deduced from each other by 
two translations. 

(a) (b) (c) 

r / 




Fig. 1. Antenna array (a) is obtained as the tensor product between subarrays (b) and (c) 



Example 32. Define the array by 

12 12 

[bi,b 2 ,...,b 6 ] = 

1 1 1 

This array, depicted in Figure^a, can be obtained either by the union of subarray of Figure^b and its translation 
defined by Figure |2]c, or by the array of Figure |2]c translated three times according to Figure [2]£>. We agree to 
express this relationship by the equation: 



o — o o — o 



ra=n» 

Another decomposition may be obtained as 

In fact, = £ (g) and o— o— o = c~o (g> o— o. However, it is important to stress that the various decompositions 
of the whole array into tensor products of subarrays are not equivalent from the point of view of performance. In 
particular, the Kruskal 's bound can be different, as will be pointed out next. 



(a) 



(b) 



(c) 



Fig. 2. Antenna array (a) is obtained as the tensor product between subarrays (b) and (c) 



Similar observations can be made for grid arrays in general. 
Example 33. Take an array of 9 sensors located at (x, y) G {1, 2, 3} x {1, 2, 3}. We have the relations 



among others. 




January 1, 2013 



DRAFT 



31 



Let's now have a look at the maximal number of sources r max that can be extracted from a n\ x n 2 x n 3 . A 
sufficient condition is that the total number of paths, r, is smaller than Kruskal's bound §25) . We shall simplify 
the bound by making two assumptions: (a) the loading matrices are generic, that is, they are full rank, and (b) the 
number of paths is larger than the sizes n\ and n 2 of the two subarrays entering the array tensor product, and 
smaller than the number of time samples, 123. Under these simplifying assumptions, Kruskal's bound becomes 
2r max < rti + n 2 + r max - 2, or: 

^max = ni + n 2 - 2 (60) 
The table below illustrates the fact that the choice of subarrays has an impact on this bound. 



Array 


Sub array 
product 


n\ n 2 


^max 




V ® / 


3 2 


3 






4 2 


4 




2 3 


3 






3 3 


4 


ra®s 


6 2 


6 




4 4 


6 



C. Significance of the angular constraint 



We are now in a position to interpret the meaning of angular constraints proposed in Section VIII According to 
the notations given in ( |59l >, the first coherence 

jUi = max | ut 1 Ug | 

corresponds to the angular separation viewed from the reference subarray. In fact, vectors b; and d p having a unit 
norm, as well as vectors u p , the quantity |UpU g | may be seen as a measure of angular separation between d p and 



d 9 , as we shall now subsequently show in Proposition 35 



Definition 34. We shall say that a collection of vectors {t>i, . . . , b„} is resolvent with respect to a direction vel 3 
if there exist two indices k and I such that v = bfe — b; and 

X 

0< ||v|| < - 

where A = denotes the wavelength. 

Let hi, d p and u g be defined as in |59]l, 1 < i < m, 1 < p, q < n 2 . Then we have |[T9l : 
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Proposition 35. // {b 1; . . . , b„} is resolvent with respect to three linearly independent directions, then 

|u£u g | = 1 d p = d q . 

Proof: Assume lu^u^l = 1. Then because they are unit norm, vectors u p and u q are collinear with a unit 
modulus proportionality factor. Hence from (|59j>, for all j, k, 1 < j, k < n\, (hj — bfc) T (d p — d g ) <E XL, where 



A is defined in Definition 34 Since {bi, . . . , b„} is resolvant, there exist (fc, /) such that < ||bfc — bj|| < A/2. 
Hence, because vectors d p are unit norm, ||d p — d g | < 2 so that we necessarily have that (b^ — b;) T (d p — d q ) = 0. 
Vector (d p — d 9 ) is consequently orthogonal to (b& — b;). The same reasoning can be carried out with two other 
independent vectors. Eventually, vector (d p — d 9 ) is null because it is orthogonal to three linearly independent 
vectors in R 3 . The converse is immediate, by the definition of u g . ■ 



Note that the condition of Definition 34 is not very restrictive, since sensor arrays usually contain sensors 



separated by half a wavelength or less. Thanks to Proposition 35 we now know that uniqueness of the matrix 
factor U = [m, . . . j u r ] and identifi ability of the directions of arrival d p are equivalent. And from the results of 
Section |VIII| they are ensured by a constraint on coherence such as ( |48] i. 

From Section |IX-B| one can claim that a similar interpretation can be put forward for the second coherence, 
which measures the minimal angular separation between paths, viewed from the subarray defining translations. 

The third coherence is nothing else but the maximal correlation coefficient between signals received from various 
paths on the array: 

\"v"<t\ 

/i3 = max 



p¥=q \\<Tp\\\\a- q \\ 

As a conclusion, the tensor approximation exists and is unique if either signals propagating through various paths 
are not too much correlated, or if their direction of arrival are not too close. By "not too" it should be understood 
that the product of coherencies need to satisfy inequality ( |4"8j ) of Corollary [28] In other words, one can separate 
paths with high correlation provided they are sufficiently well separated in space. 

Hence, the decomposition of an array into a tensor product of two (or more) subarrays should not only take into 



account Kruskal's bound, as elaborated in Section IX-B but also the ability of the latter subarrays to separate two 
distinct directions of arrival (cf. Proposition |35| >. 

D. CDMA communications 



The application to antenna array processing we described in Section IX-A is one among many others, including 
Sparse Component Analysis and Compressed Sensing. Actually, the present framework also applies to all Source 
Separation problems, as those reported in lfl8ll . provided an additional diversity is available. For example, one can 
mention the case of Code Division Multiple Access (CDMA) communications. In fact, as already pointed out in 
ll6*Tl . it is possible to distinguish between symbol and chip diversities. In order to be more explicit, let's detail a 
little further the latter example. 

Consider a downlink CDMA communication with r users, each assigned a spreading sequence C p (k), 1 < p < r, 
1 < k < n. Next, denote Ai p the complex gain between sensor i, i = 1, . . . , m, and user p, Sj P the symbol sequence 
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transmitted by user p, j € Z, and H p (k) the channel impulse response of user p. Then, the signal received on 
sensor i during the fcth chip of the jth symbol period takes the form: 

r 
p=l 

where Bk p — Y^t Hp(k ~~ ^)Cp(^) denotes the output of the pth channel excited by the pth coding sequence, after 
removal of the guard chips (which may be affected by two different consecutive symbols) [61 1. 

The columns of matrix B are often referred to as "effective codes", and coincide with spreading codes if the 
channel is memoryless and noiseless. In practice, the receiver filter is matched to the transmitter shaping filter 
combined with the propagation channel, so that effective and spreading codes are ideally proportional. Under these 
conditions, the coherence nc accounts for the angular separation between spreading sequences: nc = means that 
they are orthogonal. On the other hand, /i^ = means that symbol sequences are all uncorrelated. Lastly, as seen 



in Proposition 35 hb = 1 means directions of arrival are collinear. 

Yet, in order to avoid multiple access interferences, spreading sequences are usually chosen uncorrelated for all 
delays, which implies among others that they are orthogonal. Hence the results obtained in this paper show that 
spreading sequences do not need to be orthogonal, nor symbol sequences need to be uncorrelated, if directions 
of arrival are not collinear. In particular, shorter spreading sequences may be used for the same number of users, 
which would increase the throughput. Alternatively, one could increase the number of users for a given spreading 
gain. This is made possible because the constraint of having almost orthogonal spreading sequences is relaxed. 
In addition, some directions of arrival may be collinear if the corresponding spreading sequences are sufficiently 
angularly separated. However, these conclusions are essentially valid when users are synchronized, that is, for 
downlink communications. 

E. Polarization 

The exploitation of polarization as an additional diversity takes its roots in the paper by Nehorai and Paldi 1531 . 
Several attempts to use this diversity in the frame of tensor-based source localization and estimation can be found 
in the literature. A good account of the existing approaches can be found in [34|. 

In this framework, we consider again an array of n\ sensors, whose location is defined by a 3-dimensional real 
vector hi, and we assume a narrow-band transmission in the far field (i.e., sources, or source paths, are all seen as 



plane waves at the receiver sensor array). The difference with Section IX-B is that the translation diversity is not 
mandatory anymore, provided impinging waves are polarized, and provided their polarization is neither linear nor 
circular. One measures the electric and magnetic fields at each sensor as a function of time, so that n 2 = 6. More 



precisely, vector v p of Equation (59 1 is replaced by: 

v P = B p g p (61) 



where B p is a 6 x 2 matrix depending only on the direction of arrival d p (defined in Section IX-B I, and a vector 
g p depending on the orientation and ellipticity of the polarization of the pth wave. 
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Coherences Hi and // 3 are the same as in Section IX-B and represent respectively the angular separation between 



1 








71 


{ " 




) 


— sin p 




cos 


9 p 


1 


fp = 










directions of arrival, and correlation coefficient between arriving sources. It is slightly more involved to capture the 
significance of the coherence associated with polarization, fi2- 

With this goal, we need to go into more details. Referring to [54|, and denoting a p and f3 p the orientation and 
ellipticity angles of the polarization of the pth wave, a p € (— 7r/2, 7r/2], f3 p G (— ir/4, 7r/4) — {0}, we have: 



g p = Q(a p )h p 



with 

cos 9 P sin x p 
sin p sin x P 
cos x P 

cos a sin a cos /3 P 

h p = 

— sin a cos a j sin /3 p 

where (? p £ [0,27r) and Xv ^ ( — 7r /2, tt/2] denote respectively the azimuth and elevation of the direction of arrival 
of the pth path. In particular, the unit vector defining the pth direction of arrival is: 

cos P cos Xp 
sin Op cos Xp 
sin x P 

so that the triplet (d p ,e p ,f p ) forms a right orthonormal triad. 

Lemma 36. |g p g q | = 1 if and only if a p — a q + kir and /3 p — f3 q , k 6 Z. 

Proposition 37. |vpV 9 | < 1, w/f/i equality if and only if a p = a q + kit, f3 p = j3 q , p = q + k'n and Xp = Xq> 

k, k' e Z. 



Q{a) 



Proofs are given in appendix. This proposition proves that a constraint on coherence ^ imposes sources paths 
to have either different directions of arrival or to have different polarizations. The constraint < 1 has hence a 
clear physical meaning. It is also interesting to note that /i2 < 1 contributes to satisfy /iti < 1, because /ii also 
involves directions of arrival. 



F. Fluorescence spectral analysis 

Here an another application to fluorescence spectral analysis l62l . We refer the reader to Example [6] for the nota- 
tions used here. Suppose we have I samples with an unknown number of pure substances in different concentration 
that are fluorescent. If a,-^ is the measured fluorescence emission intensity at wavelength A™ 1 of ith sample excited 
with light at wavelength XT. The measured data is a 3-dimensional hypermatrix A = (ciijk) & R /Xmx ™. At low 
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concentrations, Beer's law of spectroscopy (which in turn is a consequence of fundamental principles in quantum 
mechanics) can be linearized [49], and yields a rank-revealing decomposition: 

A = xi <£> yi ® zi H + x r (g y r ® z r . 

This can reveal the true chemical factors responsible for the data: r = rank(yl) would be the number of pure 
substances in the mixtures, x p = (%% p , . . . , xi p ) would be the relative concentrations of pth substance in specimens 
1, ...,/; y p = (yip, • • • , y m p) the excitation spectrum of pth substance; z p = (zi p , . . . , z np ) the emission spectrum 
of pth substance. The spectra would then allow one to identify the pure substances. 

Of course this is only valid in an idealized situation when the measurements are performed perfectly without 
error and noise. Under realistic noisy circumstances, one would then need to a find best rank-r approximation. 

G. Statistical independence induces diversity 

When measurements are recorded only as a function of two variables (e.g. time and space), the present framework 
can apply if at least one additional diversity is taken into account. We have seen already several instances of additional 
diversities in the previous subsections. We shall point out now a quite different way to build a function of more 
than two variables, which makes sense from the physical point of view. 

Assume the linear model below 

x(i) = Us{t) (62) 

where only signal x(t) is observed, U — [u 1; . . . , u r ] is an unknown N xr matrix, and s(t) is of dimension r with 
mutuallly statistically independent components, Sj(i). Then one can build the dth order cumulant tensor of x(£), 
T, and the latter will satisfy the multilinear model [52]: 

r 

T = (U,..., U) ■ A = X p u p ® • • • ® u p 

P =i 

where A denotes the cumulant tensor of s. Yet, because of statistical independence, A is diagonal 11811 . If d > 3, 
and if at most one entry of A is null, then matrix U and the entries A p ... p = X p can be identified. The uniqueness 
of the solution is subject to identifiability conditions evoked earlier. 

As pointed out in [18| and references therein, this kind of problems generalizes to convolutive mixtures as well, 
and finds applications in telecommunications, radar, sonar, speech processing, and biomedical engineering, among 
others. 

H. Nonstationarity induces diversity 

If a signal x(t) is nonstationary, its time-frequency transform, defined by 

X(t,f) = J x(u)k(u — t; /) du 

for some given kernel k, bears information. If variables t and / are discretized, then the values of X(t, f) can be 
stored in a matrix X, and the more nonstationary x(t), the larger the rank of X. A similar statement can be made 
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on a signal y(z) depending on a space variable z. The discrete values of the space-wavevector transform Y(z,w) 
of a field y(z) can be stored in a matrix Y. And the less homogeneous the field y(z), the larger the rank of Y. 
This is probably the reason why algorithms proposed in ll69l . Q permit to localize and extract dipole contributions 
in the brains by identifying a multilinear model, provided the have distinct time-frequency or space-wavevector 
patterns. Nevertheless, localization is guaranteed to be successful only under restrictive assumptions. 



X. Future work 

A separate article discussing practical algorithms for the bounded coherence best rank-r multilinear approxi- 
mation is under preparation with additional coauthors. These algorithms follow the general strategy of the greedy 



approximations WOGA and OGA discussed in Sections VIII-D and VIII-E but contain other elements exploiting the 
special separable structure of our problem. Extensive numerical experiments will be provided in the forthcoming 
article. 



Appendix 



Proof of Lemma 36 First note that Q(a p ) H Q(a q ) = Q(a q — a p ). Hence g p g q can be of unit modulus only 
if hp and Q(a q — a p )h q are collinear. But the first entry of h p is real and the second is pure imaginary. Hence, 
corresponding imaginary and real parts of Q(a q — a p )h q must be zero, which implies that sin(a 9 — a p ) = 0. 
Consequently Q(a q — a p ) — ±1, which yields h p = ±h g . But because angle lies in the interval (— 7r/4, 7r/4), 
only the positive sign is acceptable. ■ 



Proof of Proposition 37- We have |VpV g | = \g p BjB q g q \. Notice that matrix BjB q is of the form 



Bj B q = 



7 n 

-V 7 



where 7 and r\ are real, 7 = ^(eje q + fjfq) and 77 = |(elf g — f p e q ). Yet, since vectors g p and g q are of unit 
modulus, |vpV 9 | can be of unit modulus only if matrix B^B q has an eigenvalue of unit modulus, which requires 
that 7 2 + T] 2 = 1. Let us prove that we have -f 2 + rj 2 < 1 with equality if and only if results of Proposition 37 are 
satisfied. 

With this goal, define the three 6-dimensional vectors: 



1 

71 



1 

71 



w = 



1 

71 



-e„ 



Then 7 = z T w and 7 = z T w'. Now decompose vector z into two orthogonal parts: z = zq + z\, with zo 6 
span{w, w'} and Zo-Lzi. Clearly, 7 2 + rf = ||zo|| 2 . It is also bounded by one because ||zo|| 2 < ||z|| 2 = 1, with 
equality if and only if z £ span{w,w'}. By inspection of the definitions of e p and e q , we see that the third entry 
of z and w is null. Hence z € span{w,w'} is possible only if either z is collinear to w or if the third entry of 
w' is null. In the latter case, it means that \q — 7r /2, and then that \ p — -Kj2 and 8 p = 8 q . In the former case, it 
can be seen that sin6> p = sin6* g , and finally that \p = Xq- 
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The last step is to rewrite 7 and r\ as a function of angle 6 p — 6 q , using trigonometric relations: 7 = cos(f? p — 
9q)(l + sinxp sinxg) + cosxp cos \ q and 77 = sin(0 p — 6> g )(sinx p + sinx 9 )- This eventually shows that 7 = 1 and 
77 = 0. As a consequence, |v p v g | = 1 only if BjB q = I, and the proposition follows by applying the lemma. ■ 
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